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Abstract 

This  paper  studies  the  long-existing  idea  of  adding  a  nice  smooth  function  to  “smooth”  a  non- 
differentiable  objective  function  in  the  context  of  sparse  optimization,  in  particular,  the  minimization  of 
||  x|| 1  +  y||x||l,  where  x  is  a  vector,  as  well  as  the  minimization  of  ||X||*  +  -T  ||X||J.,  where  X  is  a  matrix 
and  || X || *  and  ||X||f  are  the  nuclear  and  Frobenius  norms  of  X,  respectively.  We  show  that  they  let 
sparse  vectors  and  low-rank  matrices  be  efficiently  recovered.  In  particular,  they  enjoy  exact  and  stable 
recovery  guarantees  similar  to  those  known  for  the  minimization  of  ||x||i  and  ||X||*  under  the  conditions 
on  the  sensing  operator  such  as  its  null-space  property,  restricted  isometry  property,  spherical  section 
property,  or  “RIPless”  property.  To  recover  a  (nearly)  sparse  vector  x°,  minimizing  ||x||i  +  ^||x|||  returns 
(nearly)  the  same  solution  as  minimizing  ||x||i  whenever  a  >  10|| x° || oo -  The  same  relation  also  holds 
between  minimizing  ||X||»  +  ^||X||^  and  minimizing  |jX||*  for  recovering  a  (nearly)  low-rank  matrix  X° 
if  a  >  10 1| X° ||  2 -  Furthermore,  we  show  that  the  linearized  Bregman  algorithm,  as  well  as  its  two  fast 
variants,  for  minimizing  ||x||i  +  ^||x|||  subject  to  Ax  =  b  enjoys  global  linear  convergence  as  long  as  a 
nonzero  solution  exists,  and  we  give  an  explicit  rate  of  convergence.  The  convergence  property  does  not 
require  a  solution  or  any  properties  on  A.  To  our  knowledge,  this  is  the  best  known  global  convergence 
result  for  first-order  sparse  optimization  algorithms. 


1  Introduction 

Sparse  vector  recovery  and  low-rank  matrix  recovery  problems  have  drawn  lots  of  attention  from  researchers 
in  different  fields  in  the  past  several  years.  They  have  wide  applications  in  compressive  sensing,  signal/image 
processing,  machine  learning,  etc.  The  fundamental  problem  of  sparse  vector  recovery  is  to  find  the  vector 
with  (nearly)  fewest  nonzero  entries  from  an  underdetermined  linear  system  Ax  =  b,  and  that  of  low-rank 
matrix  recovery  is  to  find  a  matrix  of  (nearly)  lowest  rank  from  an  underdetermined  A(X)  =  b,  where  A  is 
a  linear  operator. 

To  recover  a  sparse  vector  x°,  a  well-known  model  is  the  basis  pursuit  problem  [10]: 

min{||x||i  :  Ax  =  b}.  (1) 

X 

For  vector  b  with  noise  or  generated  by  an  approximately  sparse  vector,  a  variant  of  (1)  is 

min{||x||i  :  ||  Ax  -  b||2  <  er}.  (2) 

X 
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To  recover  a  low-rank  matrix  X°  £  R"lXn2  from  linear  measurements  b  =  _4(X°),  which  stand  for  bi  = 
trace(A^ X°)  for  a  given  matrix  A.;  £  K”lX™2,  i  =  1,2, ...  ,m,  a  popular  approach  is  the  convex  model  (cf. 
[13,  8,  35]) 

min{||X|U  :A(X)  =  b},  (3) 

where  ||X||*  equals  the  summation  of  the  singular  values  of  X.  Similar  to  (2),  a  useful  variant  of  (3)  is 

mm{||X||*:||A(X)-b||2<a},  (4) 

The  nonsmooth  objective  functions  in  problems  (l)-(4)  pose  numerical  challenges.  We  augment  or 
“smooth”  them  by  adding  y-||x|||  or  W||x|||,,  where  a  is  a  positive  scalar.  We  argue  that  minimizing 
the  augmented  objective  ||x||i  +  A;||x|||,  as  well  as  ||X||*  +  ^||X|||,,  leads  to  fast  numerical  algorithms 
because  not  only  accurate  solutions  can  be  obtained  by  using  a  sufficiently  large,  yet  not  very  large,  a  but 
the  Lagrange  dual  problems  are  also  continuously  differentiable  and  subject  to  gradient-based  acceleration 
techniques  such  as  line  search.  Next,  we  briefly  review  the  related  works  and  summarize  the  contributions 
of  this  paper. 

The  augmented  model  for  (1)  is 

mm  j||x||i  +  ^-||x||2  :  Ax  =  b|,  (5) 

which  can  be  solved  by  the  linearized  Bregman  algorithm  (LBreg)  [40],  which  is  analyzed  in  [3,  39].  (Note 
that  LBreg  is  different  from  the  Bregman  algorithm  [31,  40],  which  solves  problem  (1)  instead  of  (5).) 

The  exact  regularization  property  of  (5)  is  proved  in  [39]:  the  solution  to  (5)  is  also  a  solution  to  (1)  as 
long  as  a  is  sufficiently  large.  The  property  can  also  be  obtained  from  [16].  However,  neither  paper  tells 
how  to  select  a,  whereas  the  size  of  a  affects  the  numerical  performance.  It  has  been  observed  by  several 
groups  of  researchers  that  a  larger  a  tends  to  cause  slower  convergence.  Hence,  one  would  like  to  choose  a 
moderate  a  that  is  just  large  enough  for  (5)  to  return  a  solution  to  (1).  For  recovering  a  sparse  vector  x° 
and  a  low-rank  matrix  X°,  this  paper  gives  the  simple  formulae 

a  >  10||x°||oo  and  a  >  10||X° ||2, 

respectively,  where  the  operator  norm  ||  X°  ||  2  equals  the  maximum  singular  value  of  X°.  Although  x°  and  X° 
are  not  known  when  a  must  be  set,  Hx0]^  and  || X° || 2  are  often  easy  to  estimate.  For  example,  in  compressive 
sensing,  ||x0||oo  is  the  maximum  intensity  of  the  underlying  signal  or  the  maximum  sensor  reading.  When 
the  total  energy  ||x°||2  is  roughly  known,  one  can  apply  the  more  conservative  formula:  a  >  10||x°||2  since 
1 1 x° 1 1 2  >  || x° || 00 -  Similarly,  a  more  conservative  formula  is  a  >  10|| X° || jr  for  the  matrix  case. 

This  paper  also  shows  that  the  Lagrange  dual  problem  of  (5)  is  unconstrained  and  differentiable,  and  its 
objective  is  uniformly  strongly  convex  when  restricted  to  certain  pairs  of  points.  Consequently,  algorithm 
LBreg,  as  well  as  two  faster  variants,  enjoys  global  linear  convergence;  specifically,  both  the  objective  error 
and  solution  error  are  bounded  by  0(p,k),  where  k  is  the  iteration  number  and  /x  is  a  constant  strictly  less 
than  1.  The  value  of  /x  depends  on  a,  the  dynamic  range  of  the  solution’s  nonzero  entries,  as  well  as  some 
properties  of  A.  Although  several  first-order  algorithms  for  (1)  have  been  shown  to  have  asymptotic  linear 
convergence,  no  global  linear  convergence  with  an  explicit  rate  has  previously  been  given  for  any  first-order 
sparse  optimization  algorithms,  to  the  best  of  our  knowledge. 

LBreg  without  acceleration  is  not  very  efficient  because  it  is  equivalent  to  the  dual  gradient  ascent  with 
a  fixed  step  size,  as  shown  in  [39].  Nonetheless,  the  step  size  can  be  relaxed.  Since  the  augmentation 
term  W||x||2  makes  the  dual  problem  unconstrained  and  differentiable,  the  dual  is  subject  to  advanced 
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gradient-descent  techniques  such  as  Barzilai-Borwein  (BB)  step  sizes  [1],  non-monotone  line  search,  Nes¬ 
terov’s  technique  [29],  as  well  as  semi-smooth  Newton  methods.  Indeed,  LBreg  has  been  improved  in  several 
recent  works:  [32]  applies  a  kicking  trick;  [39]  considers  applying  BB  step  sizes  and  non-monotone  line  search, 
as  well  as  the  limited  memory  BFGS  method  [25];  [38]  applies  the  alternating  direction  method  to  the  La¬ 
grange  dual  of  (5);  [22]  applies  Nesterov’s  technique  [29]  and  obtains  the  convergence  rate  0(\/k2).  Based 
on  the  restricted  strong  convexity  of  the  dual  objective  and  some  existing  proofs,  we  theoretically  show  and 
numerically  demonstrate  that  LBreg  with  BB  step  sizes  with  non-monotone  line  search  also  enjoys  global 
linear  convergence. 

LBreg  has  also  been  extended  to  recovering  simply  structured  matrices.  The  algorithms  SVT  [2]  for 
matrix  completion  and  IT  [37]  for  robust  principal  components  are  of  the  LBreg  type,  namely,  they  are 
gradient  iterations  that  solve 

mm{||X|U  +  ^||X||!  :  Xy  =  My,  V(i,j)  €  fl},  (6) 

min{||L||*  +  A||S||i  +  ^  (||L||2F  +  ||S||2F)  :  L  +  S  =  D},  (7) 

respectively,  where  f 2  is  the  set  of  the  observed  matrix  entries  and  ||S||i  =  Yhi  j  [41]  shows  that  the 

exact  regularization  property  for  the  vector  case  also  holds  for  (6)  and  (7).  Although  this  paper  does  not 
analyze  (6)  and  (7)  specifically,  it  gives  recovery  guarantees  for  models 

inin|||X|U  +  ^||X||2F:A(X)  =  bJ  (8) 

and 

mm  j[[X||*  +  ^||X||2f  :  ||A(X)  -  b|[2  <  aj  (9) 

assuming  a  >  10||Xo||2. 

1.1  Organization 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  presents  several  models  with  augmented  t\  or 
augmented  nuclear-norm  objectives  and  derives  their  Lagrange  dual  problems.  The  exact  and  stable  recovery 
conditions  for  these  models  are  given  in  Section  3.  Section  4  proves  a  restricted  strongly  convex  property 
and  establishes  global  linear  convergence  for  LBreg  and  its  two  faster  variants.  The  materials  of  Sections  3 
and  4  are  technically  independent  of  each  other,  yet  they  are  two  important  sides  of  model  (5). 

2  Augmented  £\  and  nuclear-norm  models 

This  section  presents  the  primal  and  dual  problems  of  a  few  augmented  t\  and  augmented  nuclear-norm 
models. 

Equality  constrained  augmented  i\  model:  Since  ||x||i  =  max{xTz  :  z  €  M",  ||z||oo  <  1},  the  dual 
problem  of  (5)  can  be  obtained  as  follows 

min{||x||i  +  -^||x|||  :  Ax  =  b}  =  minmax ||x||i  +  -'-||x|||  -  yT(Ax-  b) 
x  Za  x  y  Za 

=  min max{xTz  +  ^-||x|||  -yTAx  +  yTb  :  Hz)^  <  1} 

X  y,z  Za 

=  max{minxTz+  ^-||x|||  -yTAx  +  bTy  :  Hz^  <  1} 
y,z  x  Za 

=  min{— bTy  +  ^||ATy  -  z|||  :  HzH^  <  1},  since  x*  =  o(ATy  -  z). 
y,z  Z 
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Eliminating  z  from  the  last  equation  gives  the  following  dual  problem. 

nun  — bTy  +  |||ATy  -  Proj[_ljl]„(ATy)||2.  (10) 

For  any  real  vector  z,  we  have  z  —  Proj^  /i]„(z)  =  shrinkAt(z),  where  shrink^  is  the  well-known  shrinkage  or 
soft-thresholding  operator  with  parameter  /r  >  0.  We  omit  p.  when  /i  =  1.  Hence,  the  second  term  in  (10) 
equals  (a/2)||  shrink(ATy)|||. 

It  is  interesting  to  compare  (10)  with  the  Lagrange  dual  of  (1): 

min{— bTy  :  ||ATy||oo  <  1}.  (11) 

y 

Instead  of  confining  each  component  of  ATy  to  [—1, 1],  (10)  applies  quadratic  penalty  to  the  violation.  This 
leads  to  its  advantage  of  being  unconstrained  and  differentiable  (despite  the  presence  of  projection). 

The  gradient  of  the  last  term  in  (10)  is  aA  shrink(ATy).  Furthermore,  given  a  solution  y*  to  (10),  one 
can  recover  the  solution  x*  =  a  shrink(ATy*)  to  (5)  (since  (10)  has  a  vanishing  gradient  Ax*  —  b  =  0,  and 
x*  and  y*  lead  to  0-gap  primal  and  dual  objectives,  respectively).  Therefore,  solving  (10)  solves  (5),  and 
it  is  easier  than  solving  (1).  In  particular,  (10)  enjoys  a  rich  set  of  classical  techniques  such  as  line  search, 
Barzilai-Borwein  steps  [1],  semi-smooth  Newton  methods,  Nesterov’s  acceleration  [29],  which  do  not  directly 
apply  to  problems  (1)  or  (11). 

Norm-constrained  augmented  For  model  (2),  the  primal  and  dual  augmented  models  are 

mm|||x||i  +  gjllxll2  :  l|Ax-  bj|2  <  crj  ,  (12) 

nun  {— bTy  -f-  cr||y || 2  +  “||ATy  -  Proj[_lil]„(ATy)||^| .  (13) 

The  objective  of  (13)  is  differentiable  except  at  y  =  0.  However,  this  is  not  an  issue  since  y  =  0  is  a  solution 
to  (13)  only  if  x  =  0  is  the  solution  to  (12).  In  other  words,  (13)  is  practically  differentiable  and  thus  also 
amenable  to  classical  gradient-based  acceleration  techniques. 

Equality-constrained  augmented  ||  •  ||*:  The  primal  and  dual  of  the  augmented  model  of  (3)  are  (8) 
and 

mm  {-bTy  +  |||A*y  -  Proj{X: ||x||2 <i} (^y ) II I’ }  .  (14) 

respectively,  where  A* y  :=  X)[=i  2 /*Aj  and  {X  :  ||X|j2  <  1}  is  the  set  of  ni-by-n2  matrices  with  spectral 
norms  no  more  than  1.  In  (14),  inside  the  Frobenius  norm  is  the  singular  value  soft-thresholding  [2]  of  A*  y. 
The  primal  and  dual  of  the  augmented  model  (4)  are  (9)  and 

myin  {-bTy  +  °-llyll2  +  f  M*y  -  Proj{x:||xn2<i}(-^*y)llF}  ,  (is) 

respectively.  Like  the  augmented  models  for  vectors,  problems  (14)  and  (15)  are  practically  differentiable 
and  thus  also  amenable  to  advanced  optimization  techniques  for  unconstrained  differentiable  problems. 

As  one  can  see,  it  is  a  routine  task  to  augment  an  fi-like  minimization  problem  and  obtain  a  problem  with 
a  strongly  convex  objective,  as  well  as  its  Lagrange  dual  with  a  differentiable  objective  and  no  constraints. 
One  can  augment  models  with  a  transform-!?!  objective,  total  variation,  ^ij2  or  ldj00  norms  (for  joint  or  group 
sparse  signal  recovery),  robust-PCA  objective,  etc.  Since  the  dual  problems  are  convex  and  differentiable, 
they  enjoy  a  rich  set  of  gradient-based  optimization  techniques. 
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3  Recovery  Guarantees 


This  section  establishes  recovery  guarantees  for  augmented  l\  models  (5)  and  (12)  and  extend  these  results 
to  matrix  recovery  models  (8)  and  (9).  The  results  for  (5)  and  (12)  are  given  based  on  a  variety  of  properties 
of  A  including  the  null-space  property  (NSP)  in  Theorem  1,  the  restricted  isometry  property  (RIP)  [9]  in 
Theorems  2  and  3,  the  spherical  section  property  (SSP)  [44]  in  Theorems  4  and  5,  and  an  “RIPless”  condition 
[6]  in  Theorem  6  below.  We  choose  to  study  all  these  different  properties  since  they  give  different  types  of 
recovery  guarantees  and  apply  to  different  type  of  matrices.  Other  than  that  NSP  is  used  in  our  proofs  for 
RIP  and  SSP,  the  other  three  properties  —  RIP,  SSP,  and  RIPless  —  do  not  dominate  one  another  in  terms 
of  usefulness.  They  together  assert  that  a  large  number  of  matrices  such  as  those  sampled  from  subgaussian 
distributions,  Fourier  and  Wash-Hadamard  ensembles,  and  random  Toeplitz  and  circulant  ensembles  are 
suitable  for  sparse  vector  recovery  by  models  (5)  and  (12). 

First,  we  present  some  numerical  simulations  to  motivate  the  subsequent  analysis. 

3.1  Motivating  examples 

We  are  interested  in  comparing  model  (5)  to  model  (1),  whose  the  performance  on  recovering  sparse  solutions 
have  been  widely  studied.  To  this  end,  we  conducted  three  sets  of  simulations.  Without  loss  of  generality, 
we  fixed  ||x0||oo  =  1  and  solved  (1)  and  then  (5)  with  a  =  1, 10,  and  25  to  reconstruct  signals  of  n  =  400 
dimensions.  We  set  the  signal  sparsity  k  =  1,  2, . . . ,  80  and  the  number  of  measurements  m  =  40, 41, . . . ,  200. 
The  entries  of  A  were  sampled  from  the  standard  Gaussian  distribution. 

It  turns  out  that  the  recovery  performance  of  (5)  depends  on  the  decay  speed  of  the  nonzero  entries  of 
the  signal  x°.  So,  we  tested  three  decay  speeds:  (i)  flat  magnitude  —  no  decay,  (ii)  independent  Gaussian 

moderate  decay,  and  (iii)  power  law  —  fast  decay.  In  the  power-law  decay,  the  ith  largest  entry  had 
magnitude  i~ 2  and  a  random  sign. 

For  each  (to,  k),  100  independent  tests  were  run,  and  the  average  of 

recovery  relative  error  ||x*  —  x° ||2/||x°  ||2  (16) 

was  recorded,  where  x*  stands  for  a  solution  of  either  (1)  or  (5).  The  slightly  smoothed  cut-off  curves  at 
two  different  levels  of  relative  errors  are  depicted  in  Figure  1.  Above  each  curve  is  the  region  where  a  model 
fails  to  recover  the  signals  to  the  specified  average  relative  error.  Hence,  a  higher  curve  means  fewer  fails 
and  thus  better  recovery  performance. 

We  can  make  following  observations. 

•  In  all  tests,  the  best  curve  is  from  BP  or  model  (1).  Closely  following  it  are  those  of  a  =  25  and  a  =  10 
of  model  (5).  As  long  as  a  >  10,  model  (5)  is  as  good  as  model  (1)  up  to  a  negligible  difference. 

•  The  curve  of  a  =  1  is  noticeably  lower  than  others  when  the  signal  is  flat  or  decays  slowly.  For  this 
reason,  we  do  not  recommend  using  a  =  ||x° ||c»  f°r  model  (5)  unless  when  the  underlying  signals  decay 
very  fast. 

•  The  differences  of  the  fours  curves  are  very  similar  across  the  two  levels  10~3  and  10-5  of  relative  errors. 
We  tested  other  levels  and  found  the  same.  Therefore,  the  performance  differences  are  independent  of 
the  error  level  chosen  to  plot  the  curves. 

Some  expert  readers  may  know  that  in  theory,  given  matrix  A,  whether  or  not  model  (1)  can  exactly 
recover  x°  solely  depends  on  sign(x°),  independent  of  its  decay  speed.  So,  one  may  wonder  why  the  BP 
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1e-3  relative  error  curves 


1e-3  relative  error  curves 


1e-3  relative  error  curves 


(a)  Flat  and  10  3 


1  e-5  relative  error  curves 


(d)  Flat  and  10-5 


(b)  Gaussian  and  10  3 


1  e-5  relative  error  curves 


(e)  Gaussian  and  10  5 


(c)  Power- law  and  10  3 


1  e-5  relative  error  curves 


(f)  Power- law  and  10  5 


Figure  1:  Curves  of  specified  recovery  relative  errors  of  model  (1)  (BP)  and  model  (5)  with  a  = 
1, 10,  25.  Above  each  curve  is  the  region  where  a  model  fails  to  achieve  the  specified  average  relative  error. 
A  higher  curve  means  better  recovery  performance. 

curves  are  not  the  same  across  different  plots.  That  is  because,  when  (1)  fails  to  recover  x°,  the  relative  error 
depends  on  the  decay  speed;  a  faster  decaying  signal,  when  not  exactly  recovered,  tends  to  have  a  smaller 
error.  This  is  why  at  the  error  level  10— 3 ,  the  BP  curve  is  obviously  higher  (better)  on  the  faster-decaying 
signals. 

3.2  Null  space  property 

Matrix  A  satisfies  the  NSP  if 


llhs||l  <  llhscl|l>  (17) 

holds  for  all  h  £  Null(A)  and  coordinate  sets  S  C  {1,2,  ,n}  of  cardinality  |<S|  <  k.  If  so,  problem  (1) 

recovers  all  /c-sparse  vectors  x°  from  measurements  b  =  Ax°.  The  NSP  is  also  necessary  for  exact  recovery 
of  all  /c-sparse  vectors  uniformly.  The  wide  use  of  NSP  can  be  found  in,  e.g.,  [11,  18,  43].  Note  that  it  holds 
regardless  the  value  of  ||x0||oo.  We  now  give  a  necessary  and  sufficient  condition  for  problem  (5). 

Theorem  1  (NSP  condition).  Assume  Ijx0]^  is  fixed.  Problem  (5)  uniquely  recovers  all  k-sparse  vectors 
x°  with  the  fixed  Hx0]^  from  measurements  b  =  Ax°  if  and  only  if 

(1  +  ^cT^)l|hslll-|!hsc|11’  (18) 

holds  for  all  vectors  h  £  Null(A)  and  coordinate  sets  S  of  cardinality  |<S|  <  k. 


6 


Proof.  Sufficiency:  Pick  any  fc-sparse  vector  x°.  Let  S  :=  supp(x°)  and  Z  =  Sc.  For  any  nonzero 
h  £  Null(A),  we  have  A(x°  +  h)  =  Ax°  =  b  and 


X°  +  h||1  +  — ||x°-hh||l  =  ||x°  +  hsW,  +  -||x°  +^111  +  1111*11!  +  ^-\\hz\\l 


2  a 


> 


2a 


2a 1 


>  Il4lli  -  Mi  +  -114112  +  ^(4,hs)  +  ^iihsiii 


|h2||1  +  — IlhzlH 


I4lli  + 


h 


4112 


I4|1  +  -„x 


0 1|  2 
2 


a 


S  111 


2a1 


IMi-  1 


■*5111 


2a 


(19) 


where  the  first  inequality  follows  from  the  triangle  inequality,  and  the  second  follows  from  |jhs|||  +  || h^: |||  = 
INI  and  <x°  ,h5)  >  -||x°  lUUhsll,  =  - 1| x° || _ || || r - 

Since  || h|||  >  0,  ||x°  +  h||i  +  ^||x°  +  h||2  is  strictly  larger  than  ||x°||i  +  ^ 1 1 x° 1 1 2  provided  that  the  second 
block  of  (19)  is  nonnegative.  Hence,  condition  (18)  is  sufficient  for  x°  to  be  the  unique  minimizer  of  (5)  . 

Necessity:  It  is  sufficient  to  show  that  for  any  given  nonzero  h  £  Null(A)  and  S  satisfying  |<S|  <  k, 
we  can  to  identify  a  fc-sparse  x°  such  that  (18)  is  necessary  for  its  exact  recovery.  To  this  end,  we  define 
x°  as  x®  =  — sign(/ii)||h||00  for  i  £  S  and  x°  =  0  for  j  £  Sc,  and  scale  x°  to  have  the  specified  ||x0||oo. 
Under  this  construction,  we  have  the  following  properties:  ||x° ||0  <  k,  ||x^  +  rhs||i  =  ||x^||i  —  || 7~h^- 1|  1 ,  and 
(x^,rh5)  =  —  llxjjlloollThslli,  for  any  0  <  r  <  1.  Now,  we  let  rh  replace  h  in  the  equation  array  (19)  and 
observe  that  both  of  the  two  inequalities  of  (19)  now  hold  with  equality.  Therefore,  since  the  exact  recovery 
of  x°  requires  ||x°  +  rh||i  +  ^||x°  +Th|||  >  ||x°||i  +  ^||x°|||,  it  also  requires 

Ikhzll,  -  (l  +  !^)  ||Ths||, 


s;l|Thb 


2  >  0 


(20) 


for  all  0  <  r  <  1,  which  in  turn  requires  (18)  to  hold. 


□ 


Remark  1.  For  any  finite  a  >  0,  (18)  is  stronger  than  (17)  due  to  the  extra  term 


Since  various 


uniform  recovery  results  establish  conditions  that  guarantee  (17),  one  can  tighten  these  conditions  so  that 
they  guarantee  (18)  and  thus  the  uniform  recovery  by  problem  (5).  How  much  tighter  these  conditions  have 
to  be  depends  on  the  value  ^x,s^°°  . 


3.3  Restricted  isometry  property 

In  this  subsection,  we  first  review  the  RIP-based  sparse  recovery  guarantees  and  then  show  that  given  certain 
RIP  conditions,  any  a  >  10||xo||2  guarantees  exact  and  stable  recovery  by  (5)  and  (12),  respectively. 

Definition  1.  [9]  The  RIP  constant  5k  of  matrix  A  is  the  smallest  value  such  that 

(i-4)l|x||l<||Ax||2<(i  +  4)||x||2  (21) 

holds  for  all  k-sparse  vectors  x  £  Rn. 

For  (1)  to  recover  any  fc-sparse  vector  uniformly,  [5]  shows  the  sufficiency  of  <  0.4142,  which  is  later 
improved  to  S2k  <  0.4531  [15],  d2k  <  0.4652  [14],  <52fc  <  0.4721  [4],  as  well  as  S2k  <  0.4931  [27].  The  bound 
is  still  being  improved.  Adapting  results  in  [27],  we  give  the  uniform  recovery  conditions  for  (5)  below. 

Theorem  2  (RIP  condition  for  exact  recovery).  Assume  that  x°  £  Mn  is  k-sparse.  If  A  satisfies  RIP  with 
52k  <  0.4404  and  a  >  10||x°||oo,  then  x°  is  the  unique  minimizer  of  (5)  given  measurements  b  :=  Ax°. 


7 


Proof.  Let  S  :=  supp(x°)  and  Z  :=  Sc.  Theorem  3.1  in  [27]  shows  that  any  h  £  Null(A)  satisfies 


||hsl|i  <  02fc||hz[|i> 


where 


Hence,  (18)  holds  provided  that 


®1k  '■  = 


4(1  +  5<?2fc  —  4(5  |fc) 
(1  —  <52fe)  (32  —  25<52fc) 


>  @2  k 


or,  in  light  of  0-2k  <  1, 

a  —  (®2k  ~ 

For  S2k  =  0.4404,  we  obtain 


,y  1  no  ,  _ _ ||xQ||oo- y/4(l  +  5(52fc-4diJ _ 

1  °°  Vi1  ~  ^2fe)(32  -  25<52fc)  -  \J 4(1  +  5 S2k  -  4<5|fc) ' 

—  l)  ||x° Hoc  «  9.9849||x0||oo  <  a,  which  proves  the  theorem. 


(22) 


(23) 

□ 


Remark  2.  Different  values  of  S2k  are  associated  with  different  conditions  on  a.  Following  (23),  if  62k  < 
0.4715,  a  >  25||x0||oo  guarantees  exact  recovery.  If  S2k  <  0.1273,  a  >  || x° || oo  guarantees  exact  recovery.  In 
general ,  a  smaller  S2k  allows  a  smaller  a. 


Next  we  study  the  case  where  b  is  noisy  or  x°  is  not  exactly  sparse,  or  both.  For  comparison,  we  present 
two  inequalities  next  to  each  other  for  problems  (2)  and  (5)  each,  where  the  first  one  is  easy  to  obtain;  see 
[5]  for  example. 

Lemma  1.  Let  x°  £  R"  be  an  arbitrary  vector,  S  be  the  coordinate  set  of  its  k  largest  components  in 
magnitude,  and  Z  :=  {1,  •  •  •  ,  n}  \  S .  Let  x*  and  x*  be  the  solutions  of  (2)  and  (12),  respectively.  The  error 
vectors  h  =  x*  —  x°  and  h  =  x*  —  x°  satisfy 


Mi<  l!Mi+  211x111!, 
I|hz||i  <  C*3 1 1  1 1 !  +  QHxllli, 

where  Ijx^Hi  is  the  best  k-term  approximation  error  of  x°  and 


C 3  := 


ss  n°° 


a  -  x' 


and  C4  := 


2a 


z  ll°° 


a  -  x' 


■2: 1 1 00 


Proof.  We  only  show  (25).  Since  x*  =  x°  +  h  is  the  minimizer  of  (12),  we  have 


hll 1  +  ^llX°  +  hlll  <  llX°l|l  +  d^llX°ll2- 


(24) 

(25) 


(26) 


(27) 


Also. 


x°  +  h||x  +  —  ||x°  +  hill  =  ||x°  +  hsh  +  -||x°  +  h^lll  +  ||x°  +  hzlU  +  -I|x|  +  h^lll 


2a1 


2a1 


>  Il4lli  -  Mi  +  ^||x°  ||i  -  1  |(x°,h5)|  +  ±\\hs\\l 

la  a  la 


+  ||h2||1-||xi||1  + 


^~\Wz\\l  -  -|(x|,h2)|  +  ^-\\hz\\l 
la  a  la 


|x°||i 


1  "-0||^)  -  211x111! -(||h5||1+1|(x°,h5)|) 


2a 


+(||h2||1--|(x|,h2)|)  +  ^||h||i 
a  la 

>  (Hx0!!!  +  J-||x°||l)  -  211x9,11!  -f1  + 


2a 


a 


Ssiiqq 


Mi 


+  1- 


1 


IIMl  +  2^l|h|1*’ 


(28) 


where  the  first  inequality  follows  from  the  triangle  inequality,  and  the  second  from  (a,  b)  <  ||a| 
Combining  (27)  and  (28),  we  obtain 


1  - 


'■Zlloo 


<  1  + 


SSlIoo 

a 


l|h5||i+2||x' 


■z  111 


and  thus  (25)  after  dropping  the  nonnegative  term  1 1  h|| 2 . 
We  now  present  the  stable  recovery  guarantee. 


□ 


Theorem  3  (RIP  condition  for  stable  recovery).  Assume  the  setting  of  Lemma  1.  Let  b  :=  Ax°  +  n,  where 
n  is  an  arbitrary  noisy  vector  with  || n|| 2  <  er.  If  A  satisfies  RIP  with  62k  <  0.3814,  then  the  solution  x*  of 
(12)  with  any  a  >  10||  x°  ||  oo  satisfies 


||x*  —  x°||i  <C\  ■  >/fc||n||2  +  C2  •  ||x^ ||  1 ,  (29) 

l|x*  -  x°||2  <Ci  •  || n|| 2  +  C2  •  \\x°z\\i/Vk,  (30) 


where  C\,  C2,  Ci,  and  C2  are  given  in  (33a) -(34b)  as  functions  of  only  62k,  C3,  and  C4  in  (26). 

Proof.  We  follow  an  argument  similar  to  that  in  [27].  According  to  Lemma  4.3  of  [27],  from  || Ah|| 2  = 
|| Ax*  -  Ax°||2  =  || Ax*  -  b  +  n||2  <  ||  Ax*  -  bj|2  +  ||n||2  <  2||n||2  and  S2k  <  2/3,  we  obtain 

IMi<  ^^Vfc||n||2  +  02fc||h2||i,  (31) 

where  d2k  is  defined  in  (22)  as  a  function  of  62k-  It  is  easy  to  verify  that  with  the  choice  of  62k  <  0.3814  and 
a  in  the  theorem,  0362k  <  1  holds  for  all  nonzero  x°.  Hence,  combining  (25)  of  Lemma  1  and  (31)  yields 
the  bound  of  ||hz||i: 

|IMi<  (1- 0362k)-1  fc3^|^=\/fc||n||2  +  C-4||x“||1J  .  (32) 

Applying  (31)  and  (32)  gives  us  (29)  or 

||x*  —  x°||i  =  || h||  1  =  ||h5||i  +  || ||  1  <  Ci x/fc 1 1 n 1 1 2  +  C^x0  -  afc(a;0)||i, 
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where 


2y/2(l  —  0362k  +  C3) 

\/l  _  ^2fe(l  —  0362k) 


1  —  (7302*; 


To  prove  (30),  we  apply  (32)  to  the  inequality  (Page  7  of  [27]) 


< 


a/1  —  62  k 


8(2  -  d2fc) 


(l-d2fe)(32  -  25<52fc) 


llhzlli 

Vk 


and  obtain  (30)  or 

Ik*  -z°l|2  =  ||h||2  <  Ci||n||2  +  <72||x°  -  x^/Vk, 

where 

C  ■=  2  (  4g3  /  2-g2fe  \ 

1  '  v/T^2^  \1-C362k\  (l-<52fc)(32-  25<52fe)  ^  ’ 

^  ,=  2C4  /  2(2  -  d2fc) 

1  —  (7302fc  Y  (1  —  i52fe)(32  —  25S2k) 


(33a) 

(33b) 


(34a) 

(34b) 

□ 


Remark  3.  4  fcey  inequality  in  the  proof  above  is  0362k  <  1,  where  C3  (cf.  (26) )  depends  on  a,  Hx^H^, 
and  1 1 1 1 00 ?  and  62k  (cf.  {22))  depends  on  62k-  If  the  nonzeros  o/x°  decay  faster  in  magnitude,  C3  becomes 
smaller  and  thus  the  condition  C302fc  <  1  is  easier  to  hold.  Therefore,  a  faster  decaying  x°  is  easier  to 
recover.  This  is  consistent  with  the  numerical  simulation  in  subsection  3.1.  In  Theorem  3,  the  condition  on 
62k  and  bound  on  a  are  given  for  the  worst  case  corresponding  to  no  decay,  namely ,  Hx^H^  =  ||x^||oo-  If 
1 1 x(l  1 1 00  >  ||x!z||oo.  one  can  allow  a  larger  62k  for  each  fixed  a  or,  equivalently,  a  smaller  a  for  each  fixed  62k- 
For  example,  i/Hx^H^  >  lOljx^Hooj  one  only  needs  62k  <  0.4348  instead  of  the  theorem-assumed  condition 
a  <  0.3814. 

There  is  also  a  trade-off  between  62k  and  a.  Under  the  worst  case  Hx^H^  =  llx'gllooj  imposing  to 
a  >  25 1 1 x°  | loo  leads  to  the  relaxed  condition  62k  <  0.4489. 


3.4  Spherical  section  property 


Next,  we  derive  exact  and  stable  recovery  conditions  based  on  the  spherical  section  property  (SSP)  [44,  36] 
of  A,  which  has  the  advantage  of  invariance  to  left-multiplying  nonsingular  matrices  to  the  sensing  matrix 
A,  as  pointed  out  in  [44].  On  the  other  hand,  more  matrices  are  known  to  satisfy  the  RIP  than  the  SSP. 


Definition  2  (A-SSP  [36]).  Let  m  and  n  be  two  integers  such  that  m  >  0,  n  >  0,  and  m  <  n.  An  ( n  —  m) 
dimensional  subspace  V  C  Rn  has  the  A  spherical  section  property  if 


|jh||i 

l|h||2  - 


(35) 


holds  for  all  nonzero  h  £  V. 

To  see  the  significance  of  (35),  we  note  that  (i)  >  2 \fk  for  all  h  £  Null(A)  is  a  sufficient  condition  for 

the  NSP  inequality  (17)  and  (ii)  due  to  [23,  17],  a  uniformly  random  (n  —  m)-dimensional  subspace  V  C  Rra 
has  the  SSP  for 

A  =  Co(log(n/m)  +  1) 
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with  probability  at  least  1  —  exp(C4(?z  —  m)),  where  Co  and  Ci  are  universal  constants.  Hence,  m  >  4fcA 
guarantees  (17)  to  hold,  and  furthermore,  if  Null(A)  is  uniformly  random,  m  =  0(k\og(n/m))  is  sufficient 
for  (17)  to  hold  with  overwhelming  probability  [44,  36].  These  results  can  be  extended  to  the  augmented 
model  (5). 

Theorem  4  (SSP  condition  for  exact  recovery).  Suppose  Null(A)  satisfies  the  A-SSP.  Let  us  fix  ||x° Hoc  and 
a  >  0.  If 

2+MuyfcAi  (36) 

a  J 

then  the  null-space  condition  (18)  holds  for  all  h  £  Null(A)  and  coordinate  sets  S  of  cardinality  |<S|  <  k.  By 
Theorem  1,  (36)  guarantees  that  problem  (5)  recovers  any  k-sparse  x°  from  measurements  b  =  Ax°. 


m  > 


Proof.  Let  S  be  a  coordinate  set  with  |S|  <  k.  Condition  (18)  is  equivalent  to 


2  + 


SSiloo 


Mi  <  llhllr. 


Since  ||h^ || i  <  \/fc||h^ ||2  <  \/fc||h||2,  (37)  holds  provided  that 

'2  + 


IN, 


a 


Ihllo  ’ 


which  itself  holds,  in  light  of  (35),  provided  that  (36)  holds. 


(37) 


(38) 

□ 


Now  we  consider  the  case  Ax°  =  b  where  x°  is  an  approximately  sparse  vector. 

Theorem  5  (SSP  condition  for  stable  recovery).  Suppose  Null(A)  satisfies  the  A-SSP.  Let  x°  £  Rn  be  an 
arbitrary  vector ,  S  be  the  coordinate  set  of  its  k  largest  components  in  magnitude,  and  Z  :=  {1,  •  •  •  ,  n}  \  S. 
Let  a  >  0  in  problem  (5).  Let  C3  and  C4  be  defined  in  (26),  which  depend  on  a.  If 

to  >  4(1  +  C3)2  kA,  (39) 


then  the  solution  x*  of  (5)  satisfies 

||x*-x°||1<4C4||x|||1) 

where  Hx^Hi  is  the  best  k-term  approximation  error  of  x° . 

Proof.  Let  h  =  x*  —  x°  £  Null  (A).  Let 

II V.  II , 

C  = 


Then  (40)  is  equivalent  to 

C  <  4C4. 

Adding  [ | h.5  [ 1 4  to  (25)  and  plugging  in  (41)  gives  us 

||h||i  <  (l  +  C3)||h5||1  +  2C4C-1||h||1, 


(40) 


(41) 

(42) 

(43) 


or  (1  —  2C4C  1)||h||1  <  (1  +  C3 ) 1 1 Hi-  If  C  <  2C4,  (42)  naturally  holds.  Otherwise,  we  have  C  >  2C4  and 

i  +  Ca  im.  „  ^  (l  +  qoVfe...  M 


< 


1  -  2C4C- 


is||i  < 


1  -  2C4C- 


(44) 


Now,  combining  A-SSP  and  (39),  we  obtain 

llhj|i 
l|h||2  - 


>2(1  +  C3)  Vk, 


which  together  with  (44)  gives  (42). 


(45) 

□ 
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3.5  “RIPless”  analysis 


The  “RIPless”  analysis  [6]  gives  non-uniform  recovery  guarantees  for  a  wide  class  of  compressive  sensing 
matrices  such  as  those  with  iid  subgaussian  entries,  orthogonal  transform  ensembles  satisfying  an  incoherence 
condition,  random  Toeplitz/circulant  ensembles,  as  well  as  certain  tight  and  continuous  frame  ensembles, 
at  0(fclog(n))  measurements.  This  analysis  is  especially  useful  in  situations  where  the  RIP,  as  well  as  NSP 
and  SSP,  is  difficult  to  check  or  does  not  hold.  In  this  subsection,  we  describe  how  to  adapt  the  “RIPless” 
analysis  to  model  (5). 

Theorem  6  (RIPless  for  exact  recovery).  Let  x°  £  R"  be  a  fixed,  k-sparse  vector.  With  probability  at  least 
1  —  5 /n  —  e~P,  x°  is  the  unique  solution  to  problem  (5)  with  b  =  Ax°  and  a  >  8 1 1 x°  1 1 2  as  long  as  the  number 
of  measurements 

to  >  Co(l  +  (3)p( A)  •  fclogn, 

where  Cq  is  a  universal  constant  and  p( A)  is  the  incoherence  parameter  of  A  (see  [6]  for  its  definition  and 
values  for  various  kinds  of  compressive  sensing  matrices). 

Proof.  The  proof  is  mostly  the  same  as  that  of  Theorem  1.1  of  [6]  except  we  shall  adapt  Lemma  3.2  of  [6]  to 
Lemma  2  below  for  our  model  (5).  We  describe  the  proof  of  the  theorem  very  briefly  here.  For  any  matrix 
A  satisfying  property  (46)  in  Lemma  2,  the  golfing  scheme  [19]  can  be  used  to  construct  a  dual  vector  y 
such  that  A*y  satisfies  property  (47)  in  Lemma  2.  The  properties  (46)  and  (47)  and  the  construction  are 
exactly  the  same  as  in  [6].  Then  Lemma  2  below  lets  this  A*y  guarantee  the  optimality  of  x°  to  (12).  □ 

Lemma  2  (Dual  certificate).  Let  x°  be  given  in  Theorem  6  and  S  :=  supp(x°).  If  A  =  ^  a2  •••  a  „] 


satisfies 


||  (A!) A5)  1||2<2  and  max  ||Asai||2  <  1 

iG«Sc 


(46) 


and  there  exists  y  such  that  v  =  A*y  satisfies 


II vs  -  sign(xs)||2  <  1/4  and  ||vsc||oo  <  1/4, 


(47) 


then  x°  is  the  unique  solution  to  (5)  with  b  =  Ax°  and  a  >  8 1 1 x°  1 1 2  - 
Proof.  Let  Z  :=  Sc.  For  any  nonzero  h  £  Null(A),  we  have  Ah  =  0  and 


l|x°  +  h||1  +  -1-||x0  +  h|| 


2 

2 


llx°lli  +  y-llx0!^]  +  [<sign(xs),hs)  +  -(x£,hs)  +  ||hz||i]  +  ^-||h||f  (48) 

2a  a  2a 


Since  the  last  term  of  (48)  is  strictly  positive,  x°  is  the  unique  solution  to  (5)  provided  that 


(sign(xs),h5)  +  -(x£,h5)  +  Hh^Hr  >  0. 
a 


(49) 


Following  the  proof  of  Lemma  3.2  in  [6]  and  from  (46)  and  (47)  we  obtain 


(sign(xs),h5)  > (||hs||2  +  ||h2||i)  and  ||hz||i  >  ^||hs||2, 
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which  together  with  a  >  8 1 1  x°  1 1 2  give 

(sign(x5),h5)  +  -(x°;,h5)  +  \\hz\\i  >  (||h5||2  +  ||hz||i)  -  ■i^^||h5||2  +  ||h.z||i 

a  4  a 

>  -^llh5||2  +  ^||h^||i  -  g  1 1  hs  1 1 2 

>  gllhslb  ^  gllhslb 

=  0. 

Hence,  x°  +  h  gives  a  strictly  worse  objective  (5)  than  x°,  so  x°  is  the  unique  solution  to  (5).  □ 


3.6  Matrix  Recovery  Guarantees 

It  is  fairly  easy  to  extend  the  results  above,  except  the  “RIPless”  analysis,  to  the  recovery  of  low-rank 
matrices.  Throughout  this  subsection,  we  let  cq(X),  i  =  1,  •  •  •  ,  m  denote  the  ith  largest  singular  value  of 
matrix  X  of  rank  m  or  less,  and  let  ||X||*  :=  cq(X),  ||X||_f  :=  (!C;=i  °f  (X)) ^2,  and  ||X||2  =  cxi(X) 
denote  the  nuclear,  Frobenius,  and  spectral  norms  of  X,  respectively. 

The  extension  is  based  on  the  following  property  of  unitarily  invariant  matrix  norms. 

Lemma  3  ([21]  Theorem  7.4.51).  Let  X  and  Y  be  two  matrices  of  the  same  size.  Any  unitarily  invariant 
norm  ||  •  ||^,  satisfies 

||£(X)-£(Y)||*<||X-Y||*,  (50) 

where  £(X)  =  diag(cri(X),  •  •  •  ,  crm(X))  and  £(Y)  =  diag(<7i(Y),  •  •  •  ,  crm(Y))  are  two  diagonal  matrices. 

In  particular,  matrices  X  and  Y  obey 

m 

^|ai(X)-ai(Y)|<||X-Y|U  (51) 

i=  1 

and 

m 

^(aI(X)-ai(Y))2<  ||X-Y|||.  (52) 

i— 1 

By  applying  (51),  [34]  shows  that  any  sufficient  conditions  based  on  RIP  and  SSP  of  A  for  recovering 
sparse  vectors  by  model  (1)  can  be  translated  to  sufficient  conditions  based  on  similar  properties  of  A  for 
recovering  low-rank  matrices  by  model  (3).  We  can  establish  similar  translations  from  model  (12)  to  model 
(9)  using  both  inequalities  (51)  and  (52).  Hence,  we  present  the  low-rank  matrix  recovery  results  only  with 
the  parts  that  are  different  from  their  vector  counterparts. 

Paper  [33]  presents  the  NSP  condition  for  problem  (3):  all  matrices  X°  of  rank  r  or  less  can  be  exactly 
recovered  by  problem  (3)  from  measurements  b  =  A(X°)  if  and  only  if  all  H  €  Null(A)\{0}  satisfy 

r  m 

5>(H)<  £  ^(H).  (53) 

2=1  i=r+ 1 

We  can  extend  this  result  to  problem  (8)  by  applying  inequalities  (51)  and  (52). 

Theorem  7  (Matrix  NSP  condition).  Assume  that  ||X°||2  is  fixed.  Problem  (8)  uniquely  recovers  all  matrices 
X°  (with  the  specified  ||X°||2)  of  rank  r  or  less  from  measurements  b  =  A(X°)  if  and  only  if 

(l+^E^WS  E  ".(H)  (54) 

'  2=1  2=r+l 

holds  for  all  matrices  H  €  Null(M). 
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Proof.  Sufficiency:  Pick  any  matrix  X°  of  rank  r  or  less  and  let  b  =  A(X°).  For  any  nonzero  H  £  Null(_4), 
we  have  M(X°  +  H)  =  _4X°  =  b.  By  using  (51)  and  (52),  we  have 


> 


X°  4 

X° 


H|U  +  ^HX°  +  Hll f  >  ll«(X°)  ^  ^(H) || !  +  ^||S(X°)  -  S(H)||| 


S1 


|X' 


0 1|  2 


X  «*(H)  -  1 


i=r-\-l 


Hx°||2 

a 


i= 1 


7i(H) 


1 

2a' 


HI 


(55) 


where  the  second  inequality  follows  from  (19)  by  letting  h  =  —  s(H)  and  S  =  {l,...,r}  and  noticing 
hs  =  E[=i  °*(H)  and  hz  =  J2Zr+i  CTi(H)- 

For  any  nonzero  H  £  Null(4l),  KHH^  >  0.  Hence,  from  (55)  and  (54),  it  follows  that  X°  +  H  leads  to  a 
strictly  worse  objective  than  X°.  That  is,  X°  is  the  unique  solution  to  problem  (8). 

Necessity:  For  any  nonzero  H  €  Null(„4)  obeying  (54),  let  H  =  UXVT  be  the  SVD  of  H.  Construct 
X°  =  — USrVT,  where  Xr  keeps  only  the  largest  r  diagonal  entries  of  X  and  sets  the  rest  to  0.  Scale  X°  so 
that  it  has  the  specified  || X° || 2 -  We  have 


X°+tH||,  +  —  ||X°+fH|||  =  ||X° 
2a 


X  »<(*«)-  (i 


i=r-\- 1 


5>,:(tH) 


i—1 


for  any  t  >  0.  For  X°  to  be  the  unique  solution  to  (8)  given  b  =  4l(X°),  we  must  have 

X  M*H)-(l  +  Wfc)X».(<H) 

_i=r+ 1  ^  i= 1 


>  0 


for  all  t  >  0.  Hence,  (54)  is  necessary. 


□ 


Paper  [35]  introduces  the  following  RIP  for  matrix  recovery. 

Definition  3  (Matrix  RIP).  Let  A4r  :=  {X  £  Rnixn2  ;  rank(X)  <  r}.  The  RIP  constant  Sr  of  linear 
operator  A  is  the  smallest  value  such  that 


(l-<5r)||X|||<||M(X)||^<(l  +  ^)||X||| 


(56) 


holds  for  all  X  £  Mr. 

To  uniformly  recover  all  matrices  of  rank  r  or  less  by  solving  (3),  it  is  sufficient  for  A  to  satisfy  8§r  <0.1 
[35],  which  has  been  improved  to  the  RIP  with  d4r  <  \/2  —  1  in  [7]  and  to  <  0.307,  as  well  as  ones 
involving  <53r,  Siri  and  5sr,  in  [28].  The  algorithm  SVP  [26]  provably  achieves  exact  recovery  if  d-2r  <  1/3. 

Next,  we  present  a  stronger  RIP-based  condition  for  the  unsmoothed  problem  (3),  and  then  extend  it  to 
the  smoothed  problem  (8)  without  a  proof. 

Theorem  8  (RIP  condition  for  exact  recovery  by  (3)).  Let  X°  be  a  matrix  with  rank  r  or  less.  Problem  (3) 
exactly  recovers  X°  from  measurements  b  =  *4(X°)  if  A  satisfies  the  RIP  with  ^  <  0.4931. 

The  proof  is  a  straightforward  extension  to  the  arguments  in  [27];  the  interested  reader  can  find  it  in 
Appendix.  Next  we  present  the  result  for  the  augmented  model  (8). 

Theorem  9  (RIP  condition  for  exact  recovery).  Let  X°  be  a  matrix  with  rank  r  or  less.  The  augmented 
model  (8)  exactly  recovers  X°  from  measurements  b  =  *4(X°)  if  A  satisfies  the  RIP  with  S2r  <  0.4404  and 
in  (8)  a  >  10||X°||2. 
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Proof.  The  proof  of  Theorem  8  in  Appendix  establishes  that  any  H  £  Null(A)  satisfies  ||H0||*  <  d2r\\  £A>i  Hj||*. 
Hence,  (54)  holds  if  ^1  +  ^  >  d2r.  The  rest  of  the  proof  is  similar  to  that  of  Theorem  2.  □ 

Skipping  a  proof  similar  to  that  of  Theorem  3,  we  present  the  stable  recovery  result  as  follows. 

Theorem  10  (RIP  condition  for  stable  recovery).  Let  X°  £  R"lXn2  be  an  arbitrary  matrix  and  (Ji(X°)  be 
its  i-th  largest  singular  value.  Let  b  :=  M(X°)  +  n,  where  A  is  a  linear  operator  and  n  is  an  arbitrary  noise 
vector.  If  A  satisfies  the  RIP  with  62r  <  0.3814,  then  the  solution  X*  of  (9)  with  any  a  >  10-  ||X°||2  satisfies 
the  error  bounds: 


||X*  -  X°||*  <Ci  •  Vk\\n\\2  +  C2  ■  b(X°),  (57) 

||X*  -  X°||F  <C4  •  || n|| 2  +  (C2/VF)  ■  d(X°),  (58) 


where  <r(X°)  :=  X^i^r+i1’"2^  <T*(-^-°)  *s  the  best  rank-r  approximation  error  of  X° ,  C\,  C2,  C\,  and  C2  are 
given  by  formulas  (33a) -(34b)  in  which  92k  shall  be  replaced  by  02r  (given  in  (99)),  and 


=  OL  +  (7 1  (X°) 

3'  a  —  (Tfc+i(X°) 


and 


C4  := 


2a 

a  -  oy+i(X°)  ’ 


(59) 


respectively. 


Although  there  are  few  discussions  on  SSP  for  low-rank  matrix  recovery  in  the  literature  (cf.  [12]),  we 
present  two  SSP-based  results  without  proofs. 


Theorem  11  (Matrix  SSP  condition  for  exact  recovery).  Let  A  :  RniXra2  — y  Km  be  a  linear  operator. 
Suppose  there  exists  A  >  0  such  that  all  nonzero  H  £  Null(A)  satisfy 

l|H|U  ^  [m 
||H||f  -  V  a- 

Assume  that  ||X°||2  and  a  >  0  are  fixed.  If 

^>(2  +  ®!)'^  (60) 

then  the  null-space  condition  (54)  holds  for  all  H  £  Null(A).  Hence,  (60)  is  sufficient  for  problem  (8)  to 
recover  any  matrices  X°  of  rank  r  or  less  from  measurements  b  =  A(X°). 

Theorem  12  (Matrix  SSP  condition  for  stable  recovery).  Assume  that  linear  operator  A  :  R"lXn2  — >•  Rm 
has  the  same  property  as  it  is  in  Theorem  11.  Let  X°  £  R”ixn2  be  an  arbitrary  matrix.  Let  a  >  0  in  problem 
(8).  Define  C3  and  C4  in  (59),  which  depend  on  a.  If 


m  >  4  (1  +  C3)2  rA, 


then  the  solution  X*  of  (8)  satisfies 

|| X*  —  X°||*  <  4C4  •  <r(X°), 

where  cr(X°)  :=  X^i^r+i1’”2^  ^(X0)  is  the  best  rank-r  approximation  error  of  IK0. 


(61) 


(62) 
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4  Global  Linear  Convergence 

Now  we  turn  to  study  the  numerical  properties  of  the  linearized  Bregman  algorithm  (LBreg)  for  the  aug¬ 
mented  model  (5) .  In  this  section,  we  show  that  LBreg,  as  well  as  its  two  fast  variants,  achieves  global  linear 
convergence  with  no  assumptions  on  the  solution  sparsity  or  aforementioned  properties  of  matrix  A.  First, 
we  review  its  four  equivalent  forms  of  LBreg  that  have  appeared  in  different  papers.  We  start  off  with  the 


dual  gradient  descent  iteration  [39]:  give  a  step  size  h  >  0,  =  0,  and  k  starting  from  0, 

y  (fc+i)  ^  y(fc)  _  h  (_b  +  aA  shrink)  ATy(fe)))  .  (63a) 

The  last  term  of  (63a)  is  the  gradient  of  the  objective  function  of  problem  (10).  By  letting  x*A)  := 
a  shrink)  A  y*-^),  one  obtains  the  “primal-dual”  form 

x(fc+1)  <-  a  shrink)  ATy(fc)),  (63b) 

y(Hi)<_yW+/l(b_Ax('!+1)).  (63c) 

The  same  iteration  is  given  in  [40,  32,  3]  as 

x(fc+1)  <-  ashrink(v(fe)),  (63d) 

v(fc+i)  «_  v(fe)  +  hAT(b  -  Ax(fe+1)),  (63e) 

where  =  ATy(fc).  Finally,  the  name  “linearized  Bregman”  comes  from  the  iteration  [40] 

x(fe+1)  <-  argminDP(fc)(x,x^)  +  h( AT(Ax«  -  b),x)  +  -L||x  -  x(fc>|||,  (63f) 

p(fc+!)  ^  p(fc)  +  hAT(h  -  Ax^)  -  1  (X(fc+1^  -  xW),  (63g) 

a 

where  x®  =  p(0)  =  0  and  the  Bregman  “distance”  £)£(•,  •)  is  defined  as 

£>/(*,  y)  =  /(x)  -  /( y)  -  (p, x  —  y),  where  p  £  df( y). 


The  last  two  terms  of  (63f)  replace  the  term  |||Ax  —  b|||  in  the  original  Bregman  iteration.  Following  [40], 
one  can  obtain  (63d)-(63e)  from  (63f)-(63g)  by  setting  v^fcl  =  p*A)  +  /iAT(b  —  Ax^fcl)  +  . 

It  is  most  convenient  to  work  with  (63a)  due  to  its  simplicity  and  gradient-descent  interpretation.  In  the 
rest  of  this  section,  we  let  /( y)  be  the  objective  function  of  (10)  and  have  V/(y)  =  — b  +  aA  shrink(ATy). 

4.1  Preliminary 

In  this  subsection,  we  prove  a  few  key  results  that  will  be  used  to  prove  the  restricted  strongly  convex 
property  in  the  next  subsection. 

Definition  4.  Let  A^j^(S)  denote  the  minimum  strictly  positive  eigenvalue  of  a  nonzero  symmetric  matrix 
S,  assuming  its  existence.  Namely, 

Aii(S)  :=miu{Al(S):Ai(S)>0}, 
where  {Ai(S)}  is  the  set  of  eigenvalues  of  S. 

Lemma  4.  Let  A  be  a  nonzero  m-by-n  matrix.  Let  D  >-  0  be  an  n-by-n  diagonal  matrix  with  strictly  positive 
diagonal  entries.  We  have 

A+A(ADAt)=  min  (Aa)T(ADAT)(Aa).  (64) 

II  Aa||2  — 1 
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Proof.  Let  r  =  rank(A)  >  1.  Since  rank(ADA  )  =  r  and  ADA1  y  0,  ADA1  has  r  strictly  positive 
eigenvalues.  Let  A  >  0  be  a  positive  eigenvalue  and  x  be  its  corresponding  eigenvector.  Since  ADATx  = 
Ax,  we  see  x  £  Range(A)  and  can  thus  write  x  =  Aq^.  From  this  and  rank(A)  =  r,  the  eigenvectors 
corresponding  to  the  r  strictly  positive  eigenvalues  span  Range(A).  Hence,  (64)  attains  its  minimum  at  the 
eigenvector  Aa  corresponding  to  the  eigenvalue  A^(ADAT).  □ 

Next,  we  show  that  a  constrained  eigenvalue  problem,  which  will  appear  in  our  proof  of  restricted  strong 
convexity,  has  a  strictly  positive  minimum  objective. 

Lemma  5.  Let  A  be  a  nonzero  m-by-n  matrix,  B  be  an  m-by-l  matrix,  and  D  ^  0  be  a  diagonal  matrix  of 
size  n  by  n.  Let  r  :=  rank([A  B])  —  rank(A),  which  satisfies  0  <  r  <  I.  Let  c  and  d  be  free  vectors  of  sizes 
n  and  I,  respectively.  The  constrained  eigenvalue  problem 

v  :=  min  { (Ac  +  Bd)T(ADAT)(Ac  +  Bd)  :  ||  Ac  +  Bd|| 2  =  1,  BT(Ac  +  Bd)  <  0,  d  >  0  }  (65) 

satisfies  v  >  i>min  >  0,  where 

fmin  :=  min  {Aj(jA1(ADAT  +  CCT)  :  C  is  an  m-by-p  submatrix  of  B,  r  <  p  <  .  (66) 

(If  p  =  0,  C  vanishes.) 

Let  us  first  explain  this  lemma.  If  A  and  B  are  orthogonal  to  each  other  (i.e.,  A  B  =  0),  then 
Bt(Ac  +  Bd)  <  0  and  d  >  0  will  force  d  =  0  and  thus  reduce  (65)  to  (64).  In  this  sense,  the  lemma 
considers  the  general  case  where  A  and  B  may  not  be  orthogonal,  and  the  result  (66)  reveals  that  (65)  can 
go  lower  than  (64)  yet  must  remain  strictly  positive.  From  another  perspective,  if  we  ignore  the  constraints 
BT(Ac  +  Bd)  <  0  in  (65),  then  we  can  choose  c  and  d  >  0  such  that  AT(Ac  +  Bd)  =  0  and  thus  have  v  =  0. 
(For  example,  we  can  choose  any  d  >  0  so  that  Bd  ^  Range(A)  and  then  choose  c  so  that  —Ac  equals  Bd’s 
projection  on  Range(A).)  Therefore,  the  role  of  the  three  constraints  in  (65)  is  to  prevent  AT(Ac  +  Bd) 
from  being  0.  Those  constraints  will  arise  during  the  study  of  certain  KKT  systems. 

Proof  of  Lemma  5.  Let  B  =  [bi  b2  •  •  •  b^].  If  r  =  0,  then  rankQA  B])  =  rank(A)  and  thus  Ac  +  Bd  £ 
Range) A).  Since  dropping  the  constraints  BT(Ac  +  Bd)  <  0  and  d  >  0  from  (65)  does  not  increase  its 
optimal  objective,  we  have  v  >  A^(ADAT)  =  t;min  >  0  from  Lemma  4. 

Now  we  consider  the  nontrivial  case  r  >  0,  i.e.,  Range) [A  B] )  D  Range) A).  Ignoring  the  constraints 
Bt(Ac  +  Bd)  <  0,  we  can  choose  c  and  d  >  0  such  that  A  (Ac  +  Bd)  =  0  and  thus  v  =  0.  (See 
the  discussions  before  the  proof  for  example.)  Therefore,  the  rest  of  the  proof  focuses  on  the  role  of  these 
constraints. 

The  proof  is  based  on  induction.  We  will  show  later  that  as  long  as  Range([A  B])  D  Range(A),  any 
minimizer  (c*,d*)  of  (65)  makes  at  least  one  of  the  constraints  BT(Ac  +  Bd)  <  0  active.  (Since  (65)  has 
a  continuous  objective  over  a  compact  feasible  set,  it  has  a  minimizer.)  Without  loss  of  generality,  suppose 
this  active  constraint  is  (Ac*  +  Bd*)  =  0.  From  this,  we  obtain 

v  =  {Ac*  +  Bd*)T(ADAT)(Ac*  +  Bd*)  =  (Ac*  +  Bd*)T(ADAT  +  bib7)(Ac*  +  Bd*). 

We  move  bi  “from  B  to  A”  by  introducing  new  matrices  Ai  :=  [A  bi],  Bi  :=  |L>2  b3  •  ■  •  bf].  Introduce 
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so  (ADA1  +  bib^ )  =  (A1D1 ).  Furthermore,  drop  the  constraints  b^ (Ac*  +  Bd*)  <  0  and  d\  >  0,  and 
consider  the  resulting  problem 


v\  :=  min  <  (A1C1  +  Bidi)T(AiDi A^ )(AiCi  +  Brdi)  : 


ci,di 


||AiCi  +  Bidi  ||2  —  1, 

B^ (A1C1  +  Bidi)  <  0,  di  >  0 


Since  (67)  only  has  the  same  objective  when  b^ (Ac*  +  Bd*)  =  0,  we  conclude 

V>V\. 

We  apply  the  same  argument  to  (67),  and  then  to  the  subsequent  problems,  inductively:  let 

|A,-c,-  +  Bjdj||2  =  1; 


v-j  :=  mm 


(A-jCj  +  Bjdj)  (AjDjA  ■  )(AjCj  +  Bjdj)  : 


v?'-'  "  1  ■  Bj(AjCj  +  Bjdj)  <  0,  <1,  >  0 


(67) 


(68) 


(69) 


where  each  A j  =  [Aj_i  by] ,  B j  =  [b;/+1  •  •  •  b^],  and  Dj  = 


D,i  0 


,  for  j  =  2, 3, . . .  ,p  until  either  p  =  t 


0  1 

(i.e.,  “all  bj’s  have  been  moved  out  of  B”)  or  Range ([Aj,  Bp])  =  Range(Ap)  (i.e.,  the  condition  for  the 
induction  breaks  down  when  j  reaches  p).  The  former  case  occurs  if  r  =  £,  and  in  this  case,  we  obtain  empty 
B^  and  d^  and  thus 

vg  =  min{(A£c^)T(AfD£A<r)(Afcf)  :  ||Afcf||2  =  1}  . 


and  from  the  induction, 


V  >  V\  >  ■  ■  ■  >  V{,. 


From  AfD^Aj  =  ADA  :  +  BBT  and  Lemma  4,  it  follows 


=  A+i(ADAT+BB1 


The  latter  case  (i.e.,  j  =  p  <  l)  occurs  if  0  <  r  <  £.  In  this  case,  p  >  r  and  the  induction  gives  v  >  v\  >  •  •  •  > 
vp.  From  Range([Ap  Bp])  =  Range(Ap)  and  the  same  argument  at  the  beginning  of  this  proof,  we  have 
vp  >  A+^(ApDpA^).  By  the  definition  of  umin,  we  have  A+A(ApDpAj)  >  t>min  and  thus  v  >  vmin  >  0. 
Hence,  Lemma  5  is  proved  for  all  three  cases:  r  =  0,  0  <  r  <  £,  and  r  =  i. 

Finally,  we  establish  the  existence  of  an  active  constraint  by  showing  that  if  Range([A  B])  jb  Range(A), 
every  solution  of  the  problem  obtained  by  removing  the  constraints  BT(Ac  +  Bd)  <  0  from  (65),  namely, 

min  {(Ac  +  Bd)T(ADAT)(Ac  +  Bd)  :  ||  Ac  +  Bdj|2  =  1,  d  >  0}  ,  (70) 

c,d 

violates  BT(Ac  +  Bd)  <  0.  Since  Range([A  B])  D  Range(A),  as  been  argued  above,  one  can  choose  c  and 
d  >  0  such  that  Ac  +  Bd  G  Null(A)  and  thus  (Ac  +  Bd)T(ADA  1  )(Ac  +  Bd)  =  0.  (See  the  discussions 
before  the  proof  for  example.)  Therefore,  any  solution  (c,  d)  of  (70)  must  attain  the  0  objective,  so 

At(Ac  +  Bd)  =  0.  (71) 


Suppose 

Bt(Ac  +  Bd)  <  0.  (72) 

i.e.,  no  constraint  is  violated.  Then,  from  d  >  0,  (72),  and  (71),  it  follows 

dTBT(Ac  +  Bd)  <0, 
ctAt(Ac  +  Bd)  =0, 
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(73) 

(74) 


so 

||Ac  +  Bcl|| |  =  dTBT(Ac  +  Bd)  +  cTAT(Ac  +  Bd)  <  0, 
which  contradicts  the  constraint  ||Ac  +  Bd||2  =  1.  Therefore,  BT(Ac  +  Bd)  <  0  cannot  hold,  and  at  least 


one  of  these  constraints  must  be  violated.  Clearly,  this  argument  applies  to  problem  (69)  for  j  =  1,  2, . . . ,  as 
long  as  j  <  i  and  Range) [Aj  Bj])  3  Range  (A,).  □ 

Lemma  6.  Let  shrink  be  the  shrinkage  operator  shrink(s)  =  sign(s)  max{|s|  —  1,0}.  Then  the  following 
inequality 

(s  -  s*)  ■  (shrink(s)  -  shrink(s*))  >  |  ghri^kls*)!  +  9  '  ^  “  s*)2  -  0  (75) 

holds  for  Vs,  s*  £  R.  The  first  equality  holds  when  s  =  — sign(s*). 

Proof.  The  first  inequality  in  (75)  can  be  proved  by  elementary  case-by-case  analysis.  The  second  one  is 
trivial.  O 


4.2  Globally  Linear  Convergence 

In  this  subsection,  we  show  that  the  LBreg  iteration  (63a),  as  a  fixed-step  size  gradient  descent  iteration  for 
(10),  generates  a  globally  linearly  convergent  sequences  {yfc}  and  {xfc}. 

To  do  this,  we  need  the  following  theorem  from  [39]  with  our  modifications  for  better  clarity.  Below,  we 
use  the  notion 


slirink(z)  :=  shrinki(z)  =  z  —  Projr_1  ji„(z)  =  sign(z)  max{|z|  —  1,  0}, 
where  sign(-),  |  •  |,  and  max}-,  •}  are  component -wise  operations. 

Theorem  13.  Let  f  denote  the  objective  function  of  problem  (10),  and  x*  denote  the  solution  of  (5),  which 
is  unique  since  it  has  a  strictly  convex  objective.  Define  coordinate  sets  S+,S-,Sq  as  the  sets  of  positive, 
negative,  and  zero  components  of  x* ,  respectively.  Corresponding  to  S+,S-,So,  decompose 

A  =  [A+,A_,A0], 

X  =  [x_|_;  x_;  x0]. 

Then,  the  set  of  solutions  of  (10)  is  given  by 

y*  =  {y'  g  Rm  :  ashrin^AV)  =  x*}  (76a) 

=  {y,eMm:AXy,-l  =  a-1x;,  Aly7  +  1  =  a^x*  ,  —  1  <  A^y' <  1},  (76b) 

which  is  a  convex  set.  Furthermore,  V/(y')  =  0,  Vy7  €  y* . 

Proof.  Any  y'  £  y*  must  satisfy  the  strong  duality  condition,  namely,  the  primal  objective  equal  to  the  dual 
objective:  — /( y')  =  ||x*||i  +  A;||x*|||.  From  this  and  Ax*  =  b,  it  is  easy  to  derive  aslnin^A^7)  =  x* 
using  a  case-by-case  analysis  on  the  sign  of  x*.  Conversely,  since  V/(y)  =  — b  +  A(a  shrink(ATy))  and 
Ax*  =  b,  any  y'  obeying  aslnin^A1^')  =  x*  satisfies  V/( y')  =  0.  Then,  y'  £  y*. 

By  the  definition  (76b),  y*  is  a  polyhedron,  so  it  is  convex.  □ 

In  general,  the  two  sets  of  equality  equations  in  (76b)  do  not  define  a  unique  y*,  so  y*  can  include 
multiple  solutions. 
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A  typical  tool  for  obtaining  global  convergence  at  a  linear  rate  (or,  global  geometric  convergence)  is  the 
strong  convexity  of  the  objective  function.  A  function  g  is  strongly  convex  with  a  constant  c  if  it  satisfies 


(y-y',V/(y)  -  V/(y'))  >  c||y-y'||2,  Vy,y'  €  dom /. 


(77) 


Strong  convexity,  however,  does  not  hold  for  our  /( y)  since  V/(y*)  =  0,  Vy*  £  y* ,  while  y*  is  not 
necessarily  a  singleton.  Nevertheless,  we  establish  the  “restricted”  strong  convexity  (78)  below. 

Lemma  7  (Restricted  strong  convexity).  Consider  problem  (10)  with  a  nonzero  m-by-n  matrix  A  and 
nonzero  vector  b.  Assume  that  Ax  =  b  are  consistent.  Let  Projy.(y)  denote  the  Euclidean  projection  of  y 
to  the  solution  set  y* .  The  objective  function  f  of  (10)  satisfies 

(y  —  Projy.  (y),  V/(y))  >  u\\y  —  Proj-y,  (y)||2,  Vy,  (78) 


where  constant 


v  =  Aa 


mm 

i£supp(x* 


2  a 


>0, 


and  Aa  =  min  {A+V  (CTC)  :  C  is  a  nonzero  submatrix  of  A.  of  m  rows } . 


(79) 


Note  that  if  we  let  y'  =  Proj^y)  and  from  V/(y')  =  0,  (78)  becomes  (y  —  y',  V/(y)  —  V/( y'))  > 
v\\y  —  y' || 2 .  Hence,  (78)  is  the  restriction  of  (77)  to  the  specially  chosen  y'.  Yet,  this  will  be  enough  for 
global  linear  convergence. 


Proof  of  Lemma  7.  Since  Ax  =  b  are  consistent,  problem  (5)  has  a  unique  solution  x*,  so  y*  is  well-defined 
and  nonempty.  If  y  £  y* ,  then  y  =  Projy^y)  and  thus  (78)  holds  trivially.  To  show  (78)  for  y  fL  y* ,  we 
shall  consider 


min 


(y-y',v/(y)-v/(y')) 

(y-y'.y-y') 


y-y 


'  ^0,  y'  =  Projy. (y).j- 


(80) 


The  proof  is  divided  to  three  parts.  The  first  part  works  out  y'  =  Proj-y,  (y)  and  express  y  —  y'  in  terms  of 
submatrices  of  A.  The  second  part  establishes  (y  —  y',  V/(y)  —  V/(y')  >  (y  —  y')TM(y  —  y'),  where  M  ^  0 
also  depends  on  submatrices  of  A.  The  last  part  invokes  Lemma  5  to  obtain  (80).  Most  of  the  effort  is  to 
decompose  A  into  the  submatrices  and  understand  how  they  contribute  to  y  —  y'  and  V/(y)  —  V/( y'). 

Part  1.  By  definition,  y'  =  Projy*(y)  is  the  solution  of 


min 

y 


(81) 


Hence,  y'  satisfies  the  KKT  conditions  of  (81).  Using  the  expression  of  y*  in  (76b),  these  conditions  are 


y-y'  =  A+A+  +  A_A_  +  A0(u  —  £),  (82a) 

/  €  y,  (82b) 

l.u  >  0,  (82c) 

(1  ~  A^ y')Tu  +  (1  +  A^ y')T-£  =  0,  (82d) 


where  A_|_  and  A_  are  the  Lagrange  multipliers  for  the  two  equality  conditions  in  (76b)  and  t  and  u  are 
those  for  the  first  and  second  inequality  conditions  in  (76b),  respectively.  Equation  (82d)  is  the  so-called 
complementarity  condition,  which  together  with  (82c),  gives  the  following  three  cases  for  V*  £  Sq: 

£i  =  0,  Ui  =  0;  if  Ui  >  0,  then  A^ y'  =  1,  U  =  0;  if  U  >  0,  then  A^ y'  =  —1,  m  =  0.  (83) 
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Part  2.  Let  A±  =  [A+,  A_].  We  first  argue  that  A±  is  a  nonzero  submatrix  of  A.  Since  A  and  b  are 
both  nonzero,  the  solution  x*  to  problem  (5)  is  nonzero.  If  some  column  a,;  of  A  is  a  zero  vector,  then  Xi  is 
free  from  the  constraints  Ax  =  b  and  thus  x*  =  0.  Hence,  all  the  columns  of  A±  are  nonzero  vectors. 

From  V/(y)  =  — b  +  qA shrink(ATy)  and  0  =  V/(y')  =  b  +  qA skrink(ATy/),  we  obtain 


(y  -  y',  V/(y))  =  (y  -  y',  V/( y)  -  V/(y')>  =  «(ATy  -  ATy',  shrink(ATy)  -  shrink(ATy'))  (84a) 

=  aJAjy  —  Ajy',  shrink(Ajy)  —  shrink(Ajy'))  (84b) 

+  a(A([y-  Aq  y', shrink) Aq  y)  -  shrink) Ajy')).  (84c) 

By  definition,  every  component  of  shrink(Ajy/)  =  a_1xj  is  nonzero,  and  all  components  of  shrink(A(]  y')  = 
a_1Xg  are  zero.  For  this  reason,  we  deal  with  (84b)  and  (84c)  separately. 

Applying  inequality  (75)  to  (84b),  we  can  “remove”  the  “shrink”  operators  for  it  as 


a(Ajy  —  Ajy*,  shrink(Ajy)  —  shrink)  Ajy')}  = 


> 


a  (aj y-  aJy')  '  (shrink(ajy)  -  shrink(ajy')) 
i£S± 


ECU 


(ai  y  —  aJ y*)2 


a-1  x*  +2 


a(y  -  y')TA±DA}(y  -  y') 


(85) 


where  D  :=  diag 


(  A 

J  a~1\x*  |+2  J 


2(Esupp(x* ) 


>-  0.  Equation  (85)  along  is  not  enough  to  bound  (80)  from  zero  since 


A±  can  have  more  columns  than  rows  and  A±DA{  can  be  rank  deficient.  So,  we  need  to  include  (84c)  in 
they  analysis,  and  we  begin  with  a  decomposition  of  the  involved  matrix  Aq: 


A0  =  [Ai  A 2  A 3  A4  A5] 

according  to  the  criteria 

y  -  y' =  A±A±  +  AiUi  +  A2u2  -  A3€3  -  A4£4,  where  u4,  u2,£3,^4  >  0,  (86a) 

Ajy>+1,  (86b) 

Ajy<+1,  (86c) 

Ajy<-1,  (86d) 

Ajy  >  — 1.  (86e) 


Equations  (86)  mean  the  followings:  (i)  the  projected  point  y'  is  actively  confined  by  the  boundaries  of  y* 
involving  [A-|  A2  A3  A4]  (c.f.,  the  last  term  of  (76b));  (ii)  A5  does  not  contribute  to  y  —  y';  (iii)  by  applying 
(83)  and  (86b)-(86e),  we  get  Ajy'  =  1,  Ajy'  =  — 1  and  can  thus  simplify  the  components  of  (84c)  involving 
A4  and  A3  as  follows: 

shrink)  A  J  y)  —  shrink)  Aj  y')  =  shrink)  A  J  y)  =  Aj  y  1  =  Aj  y  —  Aj  y',  (87a) 

shrink)  A  J  y )  -  shrink(Ajy')  =  shrink(Ajy)  =  Ajy  +  1  =  Ajy  -  Ajy'.  (87b) 

Now  we  “drop”  the  components  of  (84c)  involving  A2,  A4,  and  A5  as  follows:  from  (75),  it  follows  that 
(Aj y  —  Aj y',  shrink(Aj y)  —  shrink(Aj y'))  >  0  for  i  =  2, 4,  5.  Hence, 

5 

a(Ajy  -  A  J  y',  shrink(Ajy)  -  shrink(Ajy'))  =a^(Ajy  -  Ajy',  shrink(Ajy)  -  shrink(Ajy')) 

2—1 

>a  (A J y  -  Ajy',  shrink) Aj y)  -  shrink) Aj y')) 

2=1,3 

=a(y  -  y')T(A4Aj  +A3Aj)(y-y').  (88) 
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Now  we  combine  (85)  and  (88).  Define  A  =  [A±  Ai  (— A3)],  c  =  [A±;  Ui;£3],  B  =  [A2  (— A4)],  d  =  [u3;£3], 

D  0 
0  I 


and  D  = 
get 


.  By  (86a),  we  have  y  —  y'  =  Ac  +  Bd  and  d  >  0.  Plugging  (85)  and  (88)  into  (84),  we 
(y  -  y',  V/(y)>  >  a(Ac  +  Bd)T(ADAT)(Ac  +  Bd)  (89) 


However,  (89)  is  still  not  enough  to  bound  (80)  from  zero  since  ADAT  may  still  be  rank  deficient. 

Part  3.  To  bound  (80),  we  now  include  the  “dropped”  parts  of  A  and  apply  Lemma  5.  From  (83),  we 
have  Ajy'  =  1  and  Ajy'  =  — 1,  and  further  from  (86c)  and  (86e), 


1  >  Ajy  =Ajy'  +  A J (y  -  y')  =  +1  +  Aj(Ac  +  Bd), 
1  <  Ajy  =Aj y'  +  Aj (y  -  y')  =  -1  +  Aj (Ac  +  Bd), 


or  written  compactly, 


BT(Ac  +  Bd)  <  0. 


Now  for  the  objective  of  (80),  we  apply  (89)  and  then  Lemma  5  to  obtain 


(90) 


(y-y',v/(y)) 

(y-y'.y-y') 


>  a  ■  min 


(Ac  +  Bd)T  (Ac  +  Bd) 


>  a  •  min{A++(ADAT  +  CCT)  :  C  is  an  m-by-p  submatrix  of  B,  P>  0} 


Note  that  under  our  convention,  an  m-by-0  matrix  vanishes.  Since  matrix  A  contains  the  nonzero  matrix 
A±  as  a  submatrix,  ADA  +  CTC  is  nonzero.  Therefore,  we  have 


— y  ’  >  a  .  (min(D)jj)  •  min{Ajtjt  (CTC)  :  C  is  a  nonzero  submatrix  of  A  of  m  rows} 

(y  -  y ,  y  -  y )  1  " - v - ^ 


Aa 


>  min 


iesupp(x*)  \x*\  +  2 Oi 


Aa 


□ 

Remark  4.  If  the  entries  of  A  are  in  general  positions,  i.e.,  any  m  distinct  columns  of  A  are  linearly 
independent,  or  in  other  words,  A  has  completely  full  rank  [24],  then  all  m-by-m  submatrices  of  A  have 
full  rank  and  thus  Aa  =  {Amin(CTC)  :  C  is  an  m-by-m  submatrix  of  A}.  This  is  often  the  case  when  those 
entries  are  samples  from  i.i.d.  subgaussian  distributions,  or  the  columns  of  A  are  data  vector  independent 
of  one  another.  In  general,  the  submatrix  C*  achieving  the  minimum  Aa  has  the  maximum  number  of 
independent  columns,  i.e.,  it  contains  r  columns  from  A  where  r  =  rank(A). 

With  the  restricted  strong  convexity  property,  we  next  show  the  main  convergence  result  with  the  help 
of  the  standard  notion  of  point-to-set  distance 

dist(z, Z)  :=  min{||z  —  z'||2  :  z'  £  Z}, 

z' 

where  z  is  a  vector  and  Z  is  a  set  of  vectors.  By  convention,  the  convergence  dist(z?l',  Z)  — >  0  is  called  globally 
Q-linear  if  there  exists  p  £  (0, 1)  such  that  dist(z k+1,Z)/ dist(zfc,  Z)  <  p  for  all  k,  and  the  convergence  sk  — >  0 
is  called  globally  R- linear  if  there  exists  a  globally  Q-linear  converging  sequence  tk  — >  0  such  that  |sfc|  <  \tk\. 
Unlike  Q-linear  convergence,  R- linear  convergence  does  not  require  |sfc|  to  be  monotonic  in  k. 
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Theorem  14.  Consider  problem  (10)  with  a  nonzero  m-by-n  matrix  A  and  nonzero  vector  b.  Assume  that 
Ax  =  b  are  consistent.  Let  f  be  the  objective  function  of  problem  (10)  and  f*  be  the  optimal  objective  value. 
The  linearized  Bregman  iteration  (63a)  starting  from  any  y*-0-*  G  Rm  with  step  size 

0  <  h  <  2i7(a2||A||4), 


where  the  strong  convexity  constant  v  is  given  in  (79),  generates  a  globally  Q-linearly  converging  sequence 

{y {k\k  >  1} 

dist(y(fc\;y*)  <  (l-2^  +  /i2a2||A||4)fe/2dist(y(°),3;*),  (91) 

where  y*  is  given  in  (76).  The  objective  value  sequence  convex  Q-linearly  as 

/(y(fe))-r<  ^(l-2^  +  ft2a2||A||t)fcdist2(y(°),r).  (92) 

Furthermore,  {x^'l}  is  a  globally  R-linear  converging  sequence  since 

Hx^+D  -  x*||2  <  a||A||2  •  dist(y«,r ).  (93) 

Proof.  For  each  k,  let  y :=  Projy. (y(*9).  Hence,  dist(y^fc\ 3^*)  =  ||y^  —  y/^fc-> ||2-  Using  this  projection 
property,  we  have 

l|y(fe+i)  -  y,(fe+1)l|2  <  l|y(fc+1)  -  y'(fe)||l  (94a) 

=  l|y(fe)  -  y,(fc)  -  * V/(y(fe)) III  (94b) 

=  ||y(fe)  -  y'(fc)||l  -  2/i(V/(yW),yW  -  y '«)  +  h2||V/( yW)  -  V/(y'(fc))|||  (94c) 

<  (1  -  2 hv)  ||y(fe)  -  y,(fc)||2  +  h2||aAshrink(ATy(fe))  -  aAshrink(ATy'(fc))||i  (94a) 

<  (1  -  2 hv)  ||y<*>  -  y'«||2  +  h2a2||  A||2||  ATyW  -  ATy'(fe)||2  (94e) 

<  (1  -  2 hv  +  h2a2||A||4)  ||y(fe)  -  y'(fc)||2  (94f) 


where  we  have  used  the  nonexpansive  property  of  the  shrinkage  operator  (cf.  [20]).  Hence,  we  obtain  (91). 

To  get  (92),  we  recall  for  any  convex  /  with  L-Lipschitz  V/,  /( y)  —  /(x)  <(V/(X),  y  — x)  +  f||x  — y||2 
(see  Theorem  2.1.5  in  [30]).  Let  y  =  y W  and  x  =  y'(fc).  We  have  /(y4fc))  =  /*,  V/( y'(U)  =  0  and  from 

(91), 

/(y(fe))-r  <  ^l|y(fc)  -  y'(fc)lll  <  \  (1  -  2^  +  /l2a2||A||4)fcdist2(y(°),r), 

which  shows  (92).  When  0  <  h  <  2r^/ (cr2 1| A||4),  we  have  (l  —  2 hv  +  h2a2||A||4)  <  1.  Due  to  (63b),  (76a), 
and  the  non-expansiveness  of  shrink(-),  we  get 


||x(fc+1)  -  x*  || 2  <  ||ashrink(ATy(U)  -  ashrink(ATy,W)||2  (95a) 

<  a||ATy(U  -  ATy'(fc)||2  (95b) 

<a||A||2.||y(fc)-y'«||2,  (95c) 

which  gives  (93).  D 


Remark  5.  If  we  set  h  =  iz/(a2||  A|||),  then  the  geometric  decay  factor  {l  —  2hv  +  h2a2\\A\\^)  =  (l  —  z^2/(cr2 1|  A  ||  2)) . 
Hence,  we  find  the  convergence  rate  affected  by  v,  a,  and  ||  A ||2 .  From  the  definition  of  v  in  (79),  we  get 


decay  factor  =  1 


v 

*2|AJ 


=  1  -  w2 


(96) 
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where 


uj  :=  mm 

i£supp(x*)  2  ■ 


K\/a 


k  :=  mm 


c?IA 

A++(CTC) 

^max  (AT  A) 


:  C  is  a  nonzero  submatrix  of  A  of  m  rows 


The  constant  n  is  similar  to  the  “ condition  number ”  of  A.  Lei  r*  =  (maxiGsupp(x»)  a:*)/(miniGsupp(x.)  a:*) 
denote  the  dynamic  range  of  x* .  If  we  set  a  =  C'||x*||00)  then 


u=  (1  +  2CY*)"1. 


Jor  recovering  a  sparse  vector,  recall  that  both  the  simulations  in  Section  3.1  and  the  analysis  in  Section  3 
show  that  if  x*  has  faster  decaying  nonzero  entries,  C  can  be  set  smaller.  So,  when  r*  is  large,  one  can 
choose  a  small  C  to  counteract. 

The  proved  rate  of  convergence  is  quite  conservative.  The  dependence  on  the  solution  dynamic  range  is 
due  to  (85),  which  considers  the  worst  case  of  (75),  yet  when  this  worst  case  happens,  the  inequality  between 
(94d)  and  (94e)  can  be  improved  due  to  properties  of  the  shrinkage  operator.  In  addition,  our  analysis  on 
the  global  rate  does  not  exploit  the  possibility  that  the  algorithm  may  reach  the  optimal  active  set  in  a  finite 
number  of  iterations  and  then  exhibit  faster  linear  convergence,  typically  at  a  rate  depending  only  on  the 
active  set  of  columns  of  A  and  independent  of  the  solution’s  dynamic  range. 

The  step  size  h  <  2z'/(or|j A^)  is  also  very  conservative.  As  one  will  see  in  the  simulation  results  in 
the  next  section,  classical  techniques  for  gradient  descents  such  as  line  search  can  significantly  accelerate  the 
convergence. 


4.3  Extensions  to  two  faster  variants  of  LBreg 

We  extend  the  linear  convergence  results  to  two  variants  of  LBreg  (iteration  (63))  that  can  run  significantly 
faster  than  LBreg:  BB-line-search  [39]  and  kicking  [32].  The  former  dynamically  sets  the  step  size  h  in  (63) 
by  the  Barzilai-Borwein  method  with  nonmontone  line  search  using  techniques  from  [42].  The  latter  is  a 
simple  add-on  to  iteration  (63)  to  consolidate  a  sequence  of  consecutive  iterations  in  which  xfe  is  unchanged. 
If  xfc  =  ■  •  •  =  x.k+l ,  [32]  shows  that  yfc, . . . ,  yk+ 1  stay  on  the  same  line,  so  it  is  easy  to  skip  all  the  intermediate 
iterations  and  go  directly  to  the  end  of  the  line. 

Obviously,  since  kicking  only  skips  certain  LBreg  iterations,  it  remains  have  global  linear  convergence. 
On  the  other  hand,  given  strong  convexity,  Theorems  3.1  and  3.2  of  [42]  shows  that  BB-line-search  also 
enjoys  global  linear  convergence  (though  the  results  are  weakened  to  the  R-linear  convergence  of  /(y —  f* 
and  Ax^  —  b  in  our  case);  it  is  not  difficult  to  verify  that  the  proof  of  the  theorem  remains  to  hold  given 
only  restricted  strong  convexity.  For  space  consideration,  we  do  not  detail  the  steps  in  this  paper. 


4.4  Numerical  Demonstration 

We  present  the  results  of  simple  tests  to  demonstrate  the  convergence  of  three  algorithms:  the  original 
LBreg  iteration  (63),  kicking  [32],  and  BB-line-search  [42,  39].  Their  numerical  efficiency  and  properties 
have  been  previously  studied  in  papers  [32,  38,  22]  and  are  not  the  focus  of  this  paper,  so  we  merely  use  two 
examples  to  illustrate  global  linear  convergence.  We  generated  two  compressive  sensing  tests  where  both 
tests  had  signals  x°  with  512  entries,  out  of  which  50  were  nonzero  and  sampled  from  the  standard  Gaussian 
distribution  (for  Figure  2)  or  the  Bernoulli  distribution  (for  Figure  3).  Both  tests  had  the  same  sensing 
matrix  A  with  256  rows  and  entries  sampled  from  the  standard  Gaussian  distribution.  We  set  a  =  10|jx°||oo 
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in  each  test  and  stopped  all  the  three  algorithms  upon  || V/(y) || 2  <  10  6.  The  iterative  errors  ||xfc  —  x*||2 
and  ||yfc  —  y*||2  of  the  three  algorithms  are  depicted  in  Figures  2  and  3.  In  both  tests,  the  original  version 
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(a)  £2  error  of  primal  variable  x^) 


(b)  £2  error  of  dual  variable  y(fe) 


Figure  2:  Convergence  of  primal  and  dual  variables  of  three  algorithms  on  Gaussian  sparse  x° 


(a)  i 2  error  of  primal  variable  x^)  (b)  £2  error  of  dual  variable  y 1 k 1 

Figure  3:  Convergence  of  primal  and  dual  variables  of  three  algorithms  on  Bernoulli  sparse  x° 

was  the  slowest.  Besides  the  obvious  speed  differences,  we  observe  that  {x^}  were  not  monotonic,  there 
were  sets  of  consecutive  iterations  in  which  x^fc)  did  not  change  or  fluctuated.  Indeed,  it  is  impossible  to 
improve  its  R-linear  convergence  to  Q-linear  convergence.  In  addition,  unlike  the  other  two  algorithms, 
BB -line- search  has  non- monotonic  {y^},  which  converges  R-linearly  instead  of  Q-linearly. 

The  convergence  appears  to  have  different  stages.  The  early-middle  stage  has  much  slower  convergence 
than  the  final  stage. 

Comparing  the  results  of  two  tests,  the  convergence  was  faster  on  the  Bernoulli  sparse  signal  than  the 
Gaussian  sparse  signal.  Since  the  two  tests  used  the  same  sensing  matrix  A  and  the  same  sparsity,  the  main 
reason  should  be  the  dynamic  range  of  the  signals.  A  smaller  dynamic  range  leads  to  faster  convergence, 
which  matches  our  theoretical  result  on  the  convergence  rate. 


25 


Appendix 


Proof  of  Theorem  8.  We  establish  the  theorem  by  showing  that  (53)  holds  for  any  H  £  Null(„4)  \  {0}. 

Based  on  the  SVD  H  =  E™ i  cri(H)u;vir,  where  er^H)  is  the  i-th  largest  singular  value  of  H,  we 
decompose  H  =  H0  +  Hi  +  H2  +  •  •  •  where  H0  =  E(=i  ai(H)uivi,  Hi  =  Eilr+i  °i(H)uiVj,  H2  = 
••••  Following  these  definitions,  condition  (53)  can  be  equivalently  written  as 

||Ho||.<||£h<||..  (97) 

i>  1 


From  H  ^  0  and  the  definition  of  Ho,  we  know  that  Ho  ^  0  and  thus  A  (H0)  7^  0  due  to  the  RIP  of 
A.  From  .A(H)  =  0  and  -4(H0)  7^  0,  it  follows  that  -A(Ei>i  Hj)  7Z  0  and  thus  E;>i  H*  7^  0.  Therefore, 
Ei>  1  l|Hj||*  >  0,  and  we  can  define  t  :=  ||H1||*/(X)i>1  ||Hj||*)  >  0  and  p  :=  ||H0||*/(Ei>i  HH^*)  >  0. 

Next,  we  present  two  inequalities  without  proofs  (the  interested  reader  can  verify  them  following  the 
proofs  of  Lemmas  2.3  and  2.4  in  [27]): 


1  &2 r  ,  2  1  ,2 


(P2  +  *2)  EllH*H*  <  ||^(Ho  +  H1)||l, 


k  i>  1 


f(l  -t)  +  52r(l  -  3f/4)2 


EiiHiii*  >pcEh 


ij\  12- 


,  i>  1 


i>  2 


(98a) 

(98b) 


Since  -d(H0  +  HO  +  A  (Ei>2  H>)  =  -4(H)  =  0,  the  two  right-hand  sides  of  (98)  equal  each  other.  Hence, 


1  &2r  (  2  1  4.2 


(p2+f2)  EllHd 


t(  1  -  t)  +  <52r(l  -  3f/4)2 


ElIH* 


.  i>  1 


.  i>  1 


and  thus, 


2  £(1  -  t)  +  S2r(  1  -  3f/4)2  -  (1  -  52r)t2 

P"  < - 


1  —  S2r 

or  after  a  simple  calculation  of  the  maximum  of  t  £  [0, 1], 


P  < 


'4(1  +  5S2r  —  4(t52r.)2) 
(l-<52r.)(32  -  25ci2r) 


(99) 


If  S2r  <  (77  —  Vl337)/82  «  0.4931,  then  d2r  <  1  and  thus  p  <  1.  By  definition,  we  get  (97)  and  (53).  □ 
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